An Age Dependent Branching Model for Macro evolution 



Stephanie Keller-Schmidt 1 *, Murat Tugrul 2 *, Victor M. Egmluz 2 , 

Emilio Hernandez-Garcfa 2 and Konstantin Klemm 1 
1 Bioinformatics, Institute of Computer Science, University Leipzig 
Hartelstr. 16-18, 04107 Leipzig, Germany 
2 IFISC (CSIC-UIB) Instituto de Fisica Interdisciplinar y Sistemas Complejos, 

E-07122 Palma de Mallorca, Spain 

December 16, 2010 



Abstract 

The imbalance of phylogenetic trees exhibits 
a systematic deviation from the expectation 
of a purely random tree growth process as in 
the Yule or ERM models. Here we introduce 
an age dependent growth model based on the 
hypothesis that speciation rate is a decreasing 
function of the waiting time since the last spe- 
ciation. We find that the imbalance in terms 
of the mean distance of tips from root (Sackin 
index) grows as (logn) 2 in leading order with 
tree size n. This result is in good agreement 
with the trend observed by exhaustive anal- 
ysis of the phylogenetic databases TreeBASE 
and PANDIT. Exact likelihood computation 
of the model on the trees up to 20 tips con- 
tained in the databases is performed. Higher 
likelihoods values are found when compared 
with a previously suggested model [Blum and 
Francois, 2006]. 

keywords: macroevolution, phylogenetic trees, 
branching models, imbalance 

Reconstruction of phylogenies is instrumental for 
an understanding of the driving forces of evolution 
that have led to the diversity of living organisms. Hy- 
potheses about the dynamical rules of speciation and 
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extinction governing all evolutionary processes may 
come under scrutiny with large collections of phy- 
logenetic trees available nowadays [Sanderson et al., 
1994, Whelan et al., 2006]. A suitable starting point 
and null hypothesis is the Equal Rate Markov (ERM) 
process suggesting that species undergo further speci- 
ation at a constant homogeneous rate, independently 
of previous events and other species present. The re- 
sulting topology of the growing tree, which is equiv- 
alent to the one produced by the Yule model [Yule, 
1925], tends to generate compact and nearly balanced 
tree shapes, where shape denotes the simple rooted 
tree without branch lengths nor node labels [Mooers 
and Heard, 1997]. When comparing with the shape 
of observed trees above a certain moderate size, how- 
ever, the ERM hypothesis can be rejected, as most 
real phylogenetic trees are significantly less balanced 
than those generated by the ERM and Yule models 
[Herrada et al., 2008]. Imbalance refers to an uneven 
distribution of the number of taxa (tips) between left 
and right branches of a tree or subtree. 

The so-called beta-splitting [Aldous, 1996] is a class 
of models for stochastic tree generation with expected 
imbalance tunable by a parameter [3 G [— 3/2; +oo[. 
The ERM model is reproduced at ft = 0. The 
model of Proportional to Distinguishable Arrange- 
ments (PDA, f3 = —3/2) assigns to all tree shapes 
of a given size the same probability. The trees in 
TreeBASE have been found to match best with the 
intermediate parameter value (3 = —1.0 [Blum and 



Frangois, 2006] a case called Aldous' Branching (AB) 
model [Aldous, 1996]. The AB model and others 
[Ford, 2006] introduced to account for tree imbalance, 
however, assign probabilities to tree shapes in a way 
which is not based on any evolutionary principles. 
While the model can statistically reproduce features 
of the trees in the databases, it does not hint at any 
biological explanation of these features, as Blum and 
Frangois [2006] remark. 

Here we define a stochastic procedure to grow trees 
by iterative attachment of tips, similarly to the ERM 
model. At difference with the latter, the rate of 
speciation of a given species is assumed to decrease 
with the amount of time since last speciation of that 
species. We show below that this age model yields 
larger or equal likelihoods on small and medium-sized 
trees in the databases (where likelihood computation 
is feasible) when compared with the AB model. Fur- 
thermore, we use the depth (Sackin index) to quantify 
tree imbalance. For the age model, there is evidence 
that the expected depth increases as (logn) 2 with 
the number of tips n. This growth law, identical to 
the AB model, is in good agreement with the depth 
values obtained from the databases. 

Tree Balance 

Several indices for balance measurement have been 
proposed, used and compared in the literature (see 
Mooers and Heard [1997], Matsen [2006], Agapow 
and Purvis [2002] for detailed discussion). Here we 
consider the depth [Sackin, 1972] 

n 

d = n- 1 J2^ (!) 

»=i 

as a measure of imbalance. In a tree with n tips, di 
denotes the number of edges to be traversed to reach 
the root from node i G {!,..., n\. This measure 
may be applied to non-binary trees (including poly- 
tomies and monotomies) and favors analytical treat- 
ment. Comparisons with other measures have been 
made by Matsen [2006], Agapow and Purvis [2002]. 

For a complete binary tree, d = log 2 n since all 
n = 2 k tips are at level k. As the other extreme, a 



comb (or pectinate) tree has nd=l + 2 + -- - + (n — 
2) + 2(n— 1) resulting in asymptotically linear scaling 
d <~ n (we use the notation / ~ g to indicate similar 
asymptotic behavior, i.e. to indicate that for large 
values of n, the ratio f(n)/g(n) tends to a constant; 
an alternative notation is / € ©(g)). 

In the present work we calculated the depth d for 
all trees (and subtrees) in the phylogenetic databases 
TreeBASE (containing species phylogenies, Sander- 
son et al. [1994]) and PANDIT (protein phylogenies, 
Whelan et al. [2006]). The results in Figure 1 sug- 
gest that the average depth of the trees grows with 
the number of tips as 

d~(logn) 2 (2) 

in good approximation. Alternative analytic expres- 
sions for the growth can be fitted, in particular a 
power law d <~ n a , with a rts 0.4 describes TreeBASE 
data equally well [Herrada et al., 2008]. But for the 
larger tree sizes contained in PANDIT, the (log n) 2 
form is more accurate. 

Previous models 

A simple mechanism to generate a binary rooted tree 
is given by the ERM model. It departs from a root 
node counting as a single tip (n = 1). Then in each 
iteration, a node i is drawn from the flat distribution 
on the set of tips, and two tips are attached to node 
i, increasing the number of tips by 1. 

The ERM model provides a mechanism of a grow- 
ing tree. This mechanism may be interpreted as a 
hypothesis for evolutionary dynamics. Here the hy- 
pothesis is that all species are involved in further spe- 
ciation at the equal rates, irrespective of evolutionary 
history. The Yule model [Yule, 1925] is another im- 
plementation of this idea leading to trees with the 
same shape. 

Abstracting from the dynamics behind tree genera- 
tion, one may formulate a model directly in terms of a 
probability distribution on a set of trees. More pre- 
cisely, a probability distribution is given separately 
for each set of all eligible trees of the same given tree 
size n. Here eligible trees are oriented binary rooted 
trees. Oriented is to say that left and right subtrees 
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Figure 1: The square-root of the mean depth vs. size of phylogenetic trees contained in databases 
for species (TreeBASE; empty circles) and proteins (PANDIT; filled circles). The mean depth is av- 
eraged for all trees having the same number of tips. In this scale (log-linear), the behavior (d) ~ 
(logn) 2 is a straight line. Data from TreeBASE has been downloaded from http://www.treebase.org 
on June, 2007 containing 5,212 phylogenetic trees; data from PANDIT has been downloaded from 
http://www.ebi.ac.uk/goldman-srv/pandit on May 2008 and contains 7,738 protein families. 
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are explicitly distinguishable (by a left-right label- 
ing). By this choice we avoid the need to consider 
isomorphy classes w.r.t. left-right symmetry 

In a particular class of models, the probability 
L(T) of a tree T is defined by a product over its 
inner nodes I = {1, 2, . . . , n — 1} according to 



L{T) = JJpmodel(^l^i) 



(3) 



The model-specific probability factor p mo doi depends 
on the total number rn of tips in the subtree with root 
node i and the number of tips in the left subtree of 
i. Arguments naturally fulfill 1 < U < rii. Left-right 
symmetry is ensured by 



Pmodcl («!«•) = Pmodel(rc ~ M) 



(4) 



such that L(Ti) = £(^2) when T\ is isomorphic to 
T2. The choice of the functional form of p m odei 

de- 
termines the expected balance of the trees. By con- 
centrating probability mass at values I close to 2 and 
close to n — 1, imbalance is enforced. 

The ERM model is recovered as the case where all 
77 — I possibilities of left-right splitting are equally 
likely, so 

PYu\e(l\n) = (5) 

77—1 

with equal probability for all 77 — 1 possibilities for 
splitting 77 tips between left and right subtree. The 
expected depth grows as the logarithm of the number 
of tips, 

(d)(n) -log 77 . (6) 

The ERM model is a particular case (/3 = 0) 
of beta-splitting. This is a one-parametric class of 
models with imbalance tunable by the parameter 
Pe [-3/2, oo[. 

1 r(/j + z + i)r(/7 + 77-z + i) 
Mlln) = ^n) r ( / + i)r(n- z + i) (7) 

with suitably chosen normalization ap(n). Taking 
j3 — > 00 produces maximally balanced trees. As the 
opposite extreme, the Proportional to Distinguishable 
arrangements (PDA) model is obtained at /? = —3/2. 
Here the depth grows algebraically with 77 as 



This square-root scaling is obtained also for a dif- 
ferent model of tree growth, the activity model 
[Hernandez-Garcia et al., 2010]. 

The parameter value [3 = —1 is of particular in- 
terest because it has been demonstrated to maximize 
the agreement of beta-splitting with observed phylo- 
genetic trees [Blum and Frangois, 2006] in terms of 
imbalance. This choice (3 = — 1 is also called Aldous' 
branching (AB) model with probabilities 



p_i(/|n) 



1 



a_i(n) l(n — I) 



The expected mean depth increases as 
(d)(n) ~ (Inn) 2 . 



(9) 



(10) 



The age model 

Similar to the ERM model, the age model describes 
the growth of a binary tree by iterative stochastic ad- 
dition of tips. Each tip i is assigned an age Tj(i) being 
the time that passed from the birth of the tip, ti, to 
present time t, i.e. Tj(i) = t— ij. The growth proceeds 
by iterating the following three steps, (i) A tip i is 
chosen with probability Pi(t) inversely proportional 
to age 



Pi(t) = 



C {ty 



(ii) 



<d)(n) 



(8) 



where c(t) is chosen such that probabilities of all tips 
sum up to 1; (ii) Two new tips j and k with creation 
times tj = tk = t are attached to node i; (iii) time 
t is increased by At and the process resumes at (i). 
Here we consider a constant time increment At = 1 
unless indicated otherwise. With this choice, time 
t is equivalent to number of branching events, and 
t = n — 1. A different choice is briefly considered 
below. 

Depth asymptotics 

Now we analyze the 77-depcndcncc of the expected 
depth of trees stochastically generated by the age 
model with At = 1. Numerical and heuristic ar- 
guments strongly suggest that d — (log 77) 2 is the 
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Figure 2: The dependence of depth d on the number of tips n. Curves with symbols are the lower (squares) 
and upper (circles) bounds obtained by the recursion Eqs. (18) and (21) inserted into Eq. (23). Stochastic 
simulations yield an average depth plotted as the solid line with error bars indicating standard deviation 
over the 30 independent realizations with At — 1. Analogously, the dashed line is for stochastic simulations 
but using a time increment At = l/n. Note that y/d is plotted over a logarithmic n-axis, so the dependence 
d ~ (logn) 2 results in a straight line. The inset shows the slopes of the curves in the main panel, which 
display better the asymptotic approach to a constant slope, i.e. the approach to a (logn) 2 growth. 
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asymptotic growth law for this model, but we can 
not provide a fully rigorous demonstration of that. 
Instead we establish here upper and lower bounds for 
the depth in the model, and provide convincing nu- 
merical evidence for the (log n) 2 scaling of them, from 
which the same behaviour would hold for (d)(n). 

Let us first consider a single realization of the 
stochastic process. For each integer time t > 0, let 
8(t) be the distance from root of the two new tips 
added at time t. This means that 8(t) — 1 is the dis- 
tance from root of the tip chosen to speciate. Let 
r(t) be the age of the tip chosen at time t. Then 8(t) 
obeys the recursion 



S(t) = S(t - r(t)) + 1 . 



(12) 



for t > 1 with <5(1) = 1 as initial condition. 

Let us now consider the case that the process has 
generated the sequence 8(1), 8(2), . . . , 8(t — 1) and we 
would like to know the expectation value rj of S(t). 
In the calculation of rj also the distribution /(r, t) of 
ages of the tips of the tree enters as 

tgl+ £a/Mr-'r-» M| 

sr.. /(".<)■»-' 

Let us now establish an /-independent lower-bound 
on rj. To this end, we define a particular age distri- 
bution as 

2t-\ if r = 1 

f<(T,t)={ t~\ if 2 <T<t (14) 

0, ifi<r 

Dynamically, this age distribution is obtained when 
one of the youngest tips (t — 1) is chosen in each step. 
One can show that replacing the actual age distribu- 
tion / by /<, the expected level does not increase. 
Therefore 

,> 1+ ga;,»-^C-^ (15) 

The expectation value (8)(t) over the whole 
stochastic process is obtained formally by an aver- 
age over all histories as follows. Call V t the set of all 
eligible distance sequences of length t — 1 and T t the 



set of all eligible age distributions at time t. Then we 
may write 



(«>(*) = i+E E^/'*) 



rZ\f(r,t)r-'S(t-T) 



(16) 

with p being the joint distribution of distance se- 
quence and age distribution at a given time. An exact 
solution for (S)(t) would thus involve a recursion for 
p, which is difficult to treat. The lower bound on 77 
in Eq. (15), however, is valid for each possible real- 
ization of the process. Therefore 

(17) 

where p' is the marginal of p after summation over 
Tt- Performing the sum over T> t yields 



£a=l/<( (T >*)< 7 ~ 1 



(18) 



Thus we have established a recursion for a lower 
bound on (8). 

Likewise, the age distribution 

2t-\ if t < [t/2\ 
f>(r,t)={ t-\ ifr=(i + l)/2 (19) 
0, otherwise 

can be used to establish an upper-bound recursion. 
Dynamically, this age distribution is obtained when 
an oldest tip is chosen in each step. One can show 
that 



1 , E'^l/^T-V-Mft-T) _ 



El=\f<(^t)a-^ 



(20) 



By arguments analogous to the above, we arrive at 
the upper-bound recursion 

+ Pi) 

For transforming (8) into expected depth d, con- 
sider the sum of distances of tips from root, D(t) = 



G 



td(t). Addition of two tips at distance x from root 
increases D by 2x — (x — 1) = x + 1. Thus 

!>(*)= + (22) 

for a realization of the stochastic process with level 
sequence 6. By linearity of expectation values the 
expected depth is 

<d(t)> + l]/t. (23) 

s=2 

Figure 2 shows upper and lower bounds on the ex- 
pected depth (d) obtained as numerical solutions of 
the recursion equations (18) and (21). In the same 
diagram we plot the results of direct simulations of 
the model. In one set of simulations we use the usual 
time increment At = 1, so that t ~ n. Another set of 
simulations is performed with At = l/n to check for 
robustness under different evolution of overall specia- 
tion rates. Upper and lower bounds as well as the two 
sets of simulations strongly suggest that the asymp- 
totic growth behavior for the depth is (logn) 2 . 

Likelihood Analysis 

Abstracting from the algorithmic formulation of tree 
generation, a branching model A can be character- 
ized by the probability La(T) of obtaining a given 
tree T. The quantity La(T) is also called the likeli- 
hood of model A under the data (tree) T. When aim- 
ing at modeling empirical data, we would say that 
one model A is better than a different model B, if 
La(T) > Lb(T) for an observed tree T. 

For the ERM model and the AB models defined in 
the previous section, the calculation of likelihoods is 
straight- forward: 

L a{T)= J2 PA(s(Mt(x))\s(x)) (24) 
xei(T) 

where A is the model under consideration, I(T) is 
the set of inner nodes of tree T, s(x) is the number 
of tips in the subtree with the root x and left(x) is 
the left child node of node x. For the age model 



it is not clear if a simple method of exact likelihood 
computation exists. Here we calculate the exact value 
of L age (T) by adding up probabilities of all branching 
orders leading to the observed tree T. Details are 
described in the Appendix. 

Figure 3 shows that the likelihoods of the age 
model and AB model are clearly correlated under 
the trees in the databases. The variation of likeli- 
hoods across trees of the same size n is smaller in the 
age model compared to that in the AB model. No- 
tably, the age model has larger likelihood than the 
AB model under more than half of the trees under 
consideration, so that it can be considered a better 
description of the evolutionary process. 

Concluding remarks 

The proposed age model compares with observed 
phylogenetic trees better than previous models. In 
addition, it describes the tree generation process in 
a way which is easy to interpret biologically: it as- 
sumes that lineages which have not speciated for a 
long time would display in the future a still more 
reduced speciation rate. 

Future work should provide a more detailed analy- 
sis of the model itself and further comparison to real 
phylogenetic trees. For the former, an analytic solu- 
tion for the expected depth of for its bounds is lack- 
ing. It would be also desirable to obtain expressions 
or at least numerical results for the second and per- 
haps higher moments. For the likelihood expressions, 
a factorization or other kind of decomposition would 
allow for faster exact computation. Instead of exact 
computation, estimation by a Monte-Carlo sampling 
method may circumvent the present size limitation of 
trees in the likelihood analysis. 

An additional interesting point of analysis and 
comparison of phylogenetic trees is the distribution of 
branch length. Branch length data, however, are not 
as reliable as the topological structure of phylogenetic 
trees [Barraclough and Nee, 2001]. This argument is 
supported by Pigolotti et al. [2005], summarizing the 
variety of behaviors of distributions found in the lit- 
erature. We believe that future studies in the line of 
Venditti et al. [2010] will accumulate sufficiently rc- 
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Figure 3: Comparison between age and AB models by likelihoods under tree shapes from databases, (a) 
log(i age (T)) versus log^AB^)) for each of the 541 tree shapes T with n = 20 tips in the database PANDIT. 
The dashed line is the identity, (b) Same as (a) for the 85 tree shapes with n = 20 tips in the database 
TreeBASE. (c) Fraction of trees T with L age (T) > Lab(T), separately for each n € {5, . . . , 20}. The overall 
fraction is 0.554 = 14088/25441 for PANDIT and 0.562 = 842/1499 for TreeBASE. The number of available 
tree instances is one order of magnitude smaller in TreeBASE than in PANDIT leading to larger fluctuations 
in the TreeBASE results. 



8 



liable branch-length data to allow for comparison to 
models such as the present one. 

Finally, timing in the model is worth further clarifi- 
cation. The model describes tree growth as a Markov 
chain where exactly one speciation event occurs at 
each time step. A more realistic version would formu- 
late a Markov process that assigns a speciation rate 
to each species at any moment in continuous time. 
The choice At = l/n in the results of Fig. 2 is a first 
step in that direction. 
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APPENDIX: Likelihood compu- 
tation 



The probability L agc (T) of the age model generating 
a given rooted binary tree shape T is calculated as 
follows. The nodes of the tree are assigned unique 
labels in {1, . . . , 2n — 1} := A, where the inner nodes 
have the labels {1, ... ,n — 1} := /. The root has 
label 1. For a non-root node i > 1, we denote the 
unique parent node by m(i). We call S the set of all 
permutations of /, so each element of 5* is a bijection 
s : I 4 I. Such a permutation is to encode a branch- 
ing order of a tree: s(i) is the time step at which 
node i branches. In a valid branching order, children 
cannot branch earlier than the parent. Thus we say 
that s <E S is compatible with T, if s(i) > s(m(i)) for 
each i e I \ {1}. We call S C (T) C S the set of com- 
patible permutations. When branching according to 
s e S C (T), the set of tips at time t > 1 is 



B(s,t) = {jel\{l}\s(m(j))<t<s(j)l25) 
U{j€A\I\ s(m(j)) < t} . 



The age of a tip j at time t > 1 is t — rn(j). Thus the 
age model generates the tree T with the branching 
order given by s € S c with probability 



The overall probability of generating T with the age 
model is obtained by summing over all branching or- 
ders generating T, 



n-l 



(s(i) — s(m(i)) 1 



p(s,t)=h 



E je s (s , sW )( s W-s(™0')) 1 



(26) 




(27) 



ses c (t) 
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